Method and apparatus for tracking pitch in audio analysis

ABSTRACT

A computationally efficient and robust pitch detection and tracking system and related methods are presented. According to certain exemplary implementations a method is presented comprising identifying an initial set of pitch period candidates using a first estimation algorithm, filtering the initial set of candidates and passing the filtered candidates through a second, more accurate pitch estimation algorithm to generate a final set of pitch period candidates from which the most likely pitch value is selected.

TECHNICAL FIELD

[0001] This invention generally relates to speech recognition systemsand, more particularly, to a method and apparatus for tracking pitch inthe analysis of audio content.

BACKGROUND

[0002] Recent advances in computing power and related technology havefostered the development of a new generation of powerful softwareapplications including web-browsers, word processing and speechrecognition applications. Newer speech recognition applicationssimilarly offer a wide variety of features with impressive recognitionand prediction accuracy rates. In order to be useful to an end-user,however, these features must execute in substantially real-time.

[0003] Despite the advances in computing system technology, achievingreal-time performance in speech recognition systems remains quite achallenge. Often, speech recognition systems must trade-off performancewith accuracy. Accurate speech recognition systems typically rely ondigital signal processing algorithms and complex statistical models,generated from large speech and textual corpora.

[0004] In addition to the computational complexity of the languagemodel, another challenge to accurate speech recognition is to accuratelymodel and predict the voice characteristics of the speaker. Indeed, incertain languages, the entire meaning of a word is conveyed in the toneof the word, i.e., the pitch of the speech. Many oriental languages aretonal language, wherein the meaning of the word is partially conveyed inthe pitch (or tone) in which it is presented. Thus, speech recognitionfor such tonal languages must include a pitch tracking algorithm thatcan track changes in pitch (tone) in near real-time. As with thelanguage model above, for very large vocabulary continuous speechrecognition systems, in order to be useful, a pitch tracking system mustbe fast while providing an accurate estimate of fundamental frequency.Unfortunately, in order to provide acceptably accurate results,conventional pitch tracking systems are often slow, as the algorithmswhich analyze and track voice content for fundamental pitch values arecomputationally expensive and time consuming—unsuited for real-timeinteractive applications such as, for example, a computer interfacetechnology.

[0005] Thus, a method and apparatus for pitch tracking in audio analysisapplications is required, unencumbered by the deficiencies andlimitations commonly associated with prior art language modelingtechniques.

SUMMARY

[0006] In accordance with certain exemplary implementations, a method ispresented comprising identifying an initial set of pitch periodcandidates using a fast first pass pitch estimation algorithm, filteringthe initial set of candidates and passing the filtered candidatesthrough a second, more accurate pitch estimation algorithm to generate afinal set of pitch period candidates from which the most likely pitchvalue is selected. It will be appreciated that the dual pass pitchtracker, using two different, increasingly complex pitch estimationalgorithms on a decreasing pitch candidate sample provides near-realtime capability while limiting degradation in accuracy.

BRIEF DESCRIPTION OF THE DRAWINGS

[0007] The same reference numbers are used throughout the figures toreference like components and features.

[0008]FIG. 1 is a block diagram of an example computing system;

[0009]FIG. 2 is a block diagram of an example audio analyzer, inaccordance with the teachings of the present invention;

[0010]FIG. 3 is a block diagram of an example dual-pass pitch trackingmodule, according to certain aspects of the present invention;

[0011]FIG. 4 is a graphical illustration of an example waveform of audiocontent broken into individual pitch periods;

[0012]FIG. 5 is a graphical illustration of chart depicting thedigitized spectrum of each of the pitch periods, from which the pitchtracking module calculates the relative probability for transitionbetween discrete candidates within each pitch period;

[0013]FIG. 6 is a flow chart of an example method for tracking pitch insubstantially real-time, according to certain aspects of the presentinvention; and

[0014]FIG. 7 is a graphical illustration of an example storage mediumincluding instructions which, when executed, implement the teachings ofthe present invention, according to certain implementations of thepresent invention.

DETAILED DESCRIPTION

[0015] This invention concerns a method and apparatus for detecting andtracking pitch in support of audio content analysis. As disclosedherein, the invention is described in the broad general context ofcomputing systems of a heterogeneous network executing program modulesto perform one or more tasks. Generally, these program modules includeroutines, programs, objects, components, data structures, etc. thatperform particular tasks or implement particular abstract data types. Inthis case, the program modules may well be included within the operatingsystem or basic input/output system (BIOS) of a computing system tofacilitate the streaming of media content through heterogeneous networkelements.

[0016] As used herein, the working definition of computing system isquite broad, as the teachings of the present invention may well beadvantageously applied to a number of electronic appliances including,but not limited to, hand-held devices, communication devices, KIOSKs,personal digital assistants, multiprocessor systems,microprocessor-based or programmable consumer electronics, network PCs,minicomputers, mainframe computers, wired network elements (routers,hubs, switches, etc.), wireless network elements (e.g., base stations,switches, control centers), and the like. It is noted, however, thatmodification to the architecture and methods described herein may wellbe made without deviating from spirit and scope of the presentinvention.

[0017] Example Computing Environment

[0018]FIG. 1 illustrates an example of a suitable computing environment100 within which to practice the innovative audio analyzer of thepresent invention. It should be appreciated that computing environment100 is only one example of a suitable computing environment and is notintended to suggest any limitation as to the scope of use orfunctionality of the streaming architecture. Neither should thecomputing environment 100 be interpreted as having any dependency orrequirement relating to any one or combination of components illustratedin the exemplary computing environment 100.

[0019] The example computing system 100 is operational with numerousother general purpose or special purpose computing system environmentsor configurations. Examples of well known computing systems,environments, and/or configurations that may well benefit from theheterogeneous network transport layer protocol and dynamic,channel-adaptive error control schemes described herein include, but arenot limited to, personal computers, server computers, thin clients,thick clients, hand-held or laptop devices, multiprocessor systems,microprocessor-based systems, set top boxes, programmable consumerelectronics, wireless communication devices, wireline communicationdevices, network PCs, minicomputers, mainframe computers, distributedcomputing environments that include any of the above systems or devices,and the like.

[0020] Certain features supporting the dual-pass pitch tracking moduleof the innovative audio analyzer may well be described in the generalcontext of computer-executable instructions, such as program modules,being executed by a computer. Generally, program modules includeroutines, programs, objects, components, data structures, etc. thatperform particular tasks or implement particular abstract data types.

[0021] As shown in FIG. 1, the computing environment 100 includes ageneral-purpose computing device in the form of a computer 102. Thecomponents of computer 102 may include, but are not limited to, one ormore processors or execution units 104, a system memory 106, and a bus108 that couples various system components including the system memory106 to the processor 104.

[0022] As shown, system memory 106 includes computer readable media inthe form of volatile memory 110, such as random access memory (RAM),and/or non-volatile memory 112, such as read only memory (ROM). Thenon-volatile memory 112 includes a basic input/output system (BIOS),while the volatile memory typically includes an operating system 126,application programs 128 such as, for example, audio analyzer 129, otherprogram modules 130 and program data 132. Insofar as the instructionsand data stored in volatile memory are lost when power is removed fromthe computing system, such information is commonly stored in anon-volatile mass storage such as removable/non-removable,volatile/non-volatile computer storage media 116, accessible via datamedia interface 124. By way of example only, a hard disk drive, amagnetic disk drive (e.g., a “floppy disk”), and/or an optical diskdrive may also be implemented on computing system 102 without deviatingfrom the scope of the invention. Moreover, it should be appreciated bythose skilled in the art that other types of computer readable mediawhich can store data that is accessible by a computer, such as magneticcassettes, flash memory cards, digital video disks, random accessmemories (RAMs), read only memories (ROM), and the like, may also beused in the exemplary operating environment.

[0023] Bus 108 is intended to represent one or more of any of severaltypes of bus structures, including a memory bus or memory controller, aperipheral bus, an accelerated graphics port, and a processor or localbus using any of a variety of bus architectures. By way of example, andnot limitation, such architectures include Industry StandardArchitecture (ISA) bus, Micro Channel Architecture (MCA) bus, EnhancedISA (EISA) bus, Video Electronics Standards Association (VESA) localbus, and Peripheral Component Interconnects (PCI) bus also known asMezzanine bus.

[0024] A user may enter commands and information into computer 102through input devices such as keyboard 134 and/or a pointing device(such as a “mouse”) 136 via an input/output interface(s) 140. Otherinput devices 138 may include a microphone, joystick, game pad,satellite dish, serial port, scanner, or the like, coupled to bus 1008via input/output (I/O) interface(s) 140.

[0025] Display device 142 is intended to represent any of a number ofdisplay devices known in the art. A monitor or other type of displaydevice 142 is typically connected to bus 108 via an interface, such as avideo adapter 144. In addition to the monitor, certain computer systemsmay well include other peripheral output devices such as speakers (notshown) and printers 146, which may be connected through outputperipheral interface(s) 140.

[0026] As shown, computer 102 may operate in a networked environmentusing logical connections to one or more remote computers via one ormore I/O interface(s) 140 and/or network interface(s) 154.

[0027] Example Audio Analyzer

[0028]FIG. 2 illustrates a block diagram of an example audio analyzer129, which 9 selectively implements one or more elements of a dual-passpitch tracking system (FIG. 3), to be discussed more fully below.Although introduced as a stand-alone element within computing system100, it is to be appreciated that audio analyzer 129 may well beintegrated with or leveraged by any of a host of applications (e.g., aspeech recognition system) to provide substantially real-time pitchtracking capability to such applications.

[0029] In accordance with the illustrated exemplary implementation ofFIG. 2, audio analyzer 129 is depicted comprising one or morecontrollers 202, memory 204, an audio analysis engine 206, networkcommunication interface(s) 208 and one or more applications (e.g.,graphical user interface, speech recognition application, languageconversion application, etc.) 210, each communicatively coupled asshown. It will be appreciated that although depicted in FIG. 2 as anumber of disparate blocks, one or more of the functional elements ofthe audio analyzer 129 may well be combined/integrated intomultifunction modules. Moreover, although depicted in accordance with ahardware paradigm, those skilled in the art will appreciate that this isfor ease of explanation only, and that such functional modules may wellbe implemented in software and/or firmware without deviating from thespirit and scope of the present invention.

[0030] As alluded to above, although depicted as a separate functionalelement, audio analyzer 129 may well be implemented as a function of ahigher-level application, e.g., a word processor, web browser, speechrecognition system, or a language conversion system. In this regard,controller(s) 202 of analyzer 129 are responsive to one or moreinstructional commands from a parent application to selectively invokethe pitch tracking features of audio analyzer 129. Alternatively,analyzer 129 may well be implemented as a stand-alone analysis tool,providing a user with a user interface (e.g., 210) to selectivelyimplement the pitch tracking features of audio analyzer 129, discussedbelow.

[0031] In either case, controller(s) 202 of analyzer 129 receives audioinput and selectively invokes one or more functions of analysis engine206 (described more fully below) to identify a most likely fundamentalfrequency within each of a plurality of frames of parsed audio input.According to one implementation, the audio content is receive intomemory 204, which then supplies audio analysis engine 206 with selectsubsets of the received audio, as controlled by controller(s) 202.Alternatively, controller 202 may well direct received audio contentdirectly to the audio analysis engine 206 for pitch tracking analysis.

[0032] Except as configured to effect the teachings of the presentinvention, controller 202 is intended to represent any of a number ofalternate control systems known in the art including, but not limitedto, a microprocessor, a programmable logic array (PLA), a micro-machine,an application specific integrated circuit (ASIC) and the like. In analternate implementation, controller 202 is intended to represent aseries of executable instructions to implement the control logicdescribed above.

[0033] As shown, the innovative audio analysis engine 206 is comprisedof at least a dual-pass pitch tracking module 212. In certainimplementations, the audio analysis engine 206 may also be endowed withanother functional element which leverages the features of theinnovative dual-pass pitch tracking module 212 to foster different audioanalyses such as, for example speech recognition. In this regard, audioanalysis engine 206 is depicted comprising syllable recognition module216.

[0034] As used herein, syllable recognition module 216 is depicted toillustrate that other functional elements may well be implemented within(or external to) audio analysis engine 206 to leverage the pitchdetection attributes of dual-pass pitch tracking module 212. Inaccordance with the illustrated exemplary implementation, syllablerecognition module 216 analyzes received audio content detect phonemes,the smallest audio element of verbal communication, and compares thedetected phonemes against a language model in an attempt to detect thecontent of verbal communication. When implemented in conjunction withthe innovative dual-pass pitch tracking module 212, the syllablerecognition module 216 utilizes the pitch tracking features to discernaudio content in tonal language input. It is to be appreciated that thedual pass pitch tracking module 212 functions independently of syllablerecognition module 216. Indeed, audio analysis engine 206 may well beendowed with other audio analysis functions that leverage the pitchtracking features of dual-pass pitch tracking module 212 in placeof/addition to syllable recognition module 216.

[0035] As will be described more fully below, dual-pass pitch trackingmodule 212 receives audio content, pre-processes it to parse the audiocontent into frames, and proceeds to pass the frames of audio contentthrough a first and second pitch estimation module to identify thefundamental frequency of the audio content within each frame. That is,dual-pass pitch tracking module implements two separate pitch estimationmodules to identify the fundamental frequency of a frame of audiocontent. One exemplary architecture for just such a dual-pass pitchtracking module 212 is presented below, with reference to FIG. 3.

[0036] In addition to the foregoing, audio analyzer 129 also includesone or more network communication interface(s) 208 and may also includeone or more applications 210. According to one implementation, networkinterface(s) 208 enable audio analyzer 129 to interface with externalelements such as, for example, external applications, external hardwareelements, one or more internal busses of a host computing system and/orone or more inter-computing system networks (e.g., local area network(LAN), wide area network (WAN), global area network (Internet), and thelike). As used herein, network interface(s) 208 is intended to representany of a number of network interface(s) known in the art and, therefore,need not be further described.

[0037] Turning to FIG. 3, a block diagram of an example dual-pass pitchtracking module is presented, in accordance with certain exemplaryimplementations of the present invention. In accordance with theillustrated exemplary implementation of FIG. 3, dual-pass pitch trackingmodule 206 is presented comprising a pre-processing module 302, a firstpitch estimation module 304, a second pitch estimation module 308, azero crossing/energy detection module 310 and one or more filters 316,each coupled as shown. It should be noted that pre-processing module 302is depicted herein using a lighter, hashed line to denote that thedual-pass pitch tracking module may well function withoutpre-processing. As used herein, pre-processing module parses thereceived audio content into frames of audio content. According to oneimplementation, the frame size is pre-defined to ten (10) millisecondsworth of audio content. In alternate implementations, other frame sizesmay well be used, or the frame size may well be dynamically set based,at least in part, on one or more features of the received audio content,e.g., overall duration of audio, sampling rate, dynamic range, etc.

[0038] In addition to parsing the received audio content, pre-processingmodule 302 beneficially removes some background noise and somecomponents for the received audio content with unreasonable frequenciesin the frequency domain. In this regard, pre-processing module 302 maywell implement some filtering functions to remove such undesirable audiocontent. In addition, pre-processing module 302 estimates and removes adirect-current (DC) bias from each of the frames before passing thecontent to the pitch estimation modules.

[0039] Once parsed, each frame of the audio content is passed through afirst pitch estimation module 304, filtered, and then passed through asecond pitch estimation module 308 before additional filtering andsmoothing 316 to reveal a probable fundamental frequency (pitch value)320 for the frame. According to one implementation, the first pitchestimation module 304 implements a fast pitch estimation algorithm toidentify an initial set of pitch value candidates. The plethora of pitchvalue candidates identified by the first pitch estimation module arethen filtered to a more manageable number of candidates 306, which arepassed through a second pitch estimation module 308.

[0040] According to one implementation, the second pitch estimationmodule 308 implements a more accurate pitch estimation algorithm thanthe first pitch estimation algorithm. In this regard, the increasedcomputational complexity of the second estimation module 308 may slowthe performance of the module when compared to the first 304. Insofar asthe second pitch estimation module is acting on a smaller sample size(i.e., the filtered candidates 306 from the first pitch estimationmodule 304), the processing time is about the same or slightly less thanthe processing required by the first module 304. In this regard, thedual-pass pitch detection module 212 functions to provide an accurateand fast pitch detection capability, suitable for applications requiringsubstantially real-time pitch detection.

[0041] According to one implementation, to be described more fullybelow, the first pitch estimation module 304 implements an averagemagnitude difference function (AMDF) pitch estimation algorithm,presented mathematically in equation 1, below. $\begin{matrix}{{D_{i,k} = {\sum\limits_{j = m}^{m + n - 1}{{s_{j} - s_{j + k}}}}},{k = 0},1,L,{K - 1}} & (1)\end{matrix}$

[0042] where: s_(j) and s_(j+k) are the j^(th) and (j+k)^(th) sample inthe speech waveform, and D_(j,k) represents the similarity of the i^(th)speech frame and its adjacent neighbor with an interval of k samples.

[0043] The AMDF pitch estimation algorithm derives its performancecapability from the fact that it is performing a subtraction operationwhich, those skilled in the art will appreciate is faster to executethan other more complex operations such as multiplication, division,logarithmic functions, and the like. Thus, even though the first pitchestimation module 304 is acting on the entire sample, implementation ofthe AMDF algorithm nonetheless enables module 304 to perform thisfunction quite rapidly.

[0044] As introduced above, the AMDF algorithm is employed by pitchestimation module 304 to find potential pitch value candidates within aframe shift range of 2 ms to 20 ms. According to certain exemplaryimplementations, N possible pitch values are estimated, where N isbased, at least in part, on the speech sampling rate (R), whereinN=(shift time range)*R. For example, in the case where the speechsampling rate (R) is 16 kHz, N=288 pitch values are calculated andfiltered, to provide an initial set of M pitch value candidates (306) tothe second pitch estimation module 308. In accordance with theillustrated exemplary implementation, N>>M. The M top candidates areselected by sorting the possible pitch candidates according to the AMDFscore in the current frame and selecting the top M candidates in thisimplementation.

[0045] According to one implementation, the second pitch estimationmodule 308 implements a normalized cross correlation (NCC) pitchestimation algorithm to re-score the top M pitch value candidates fromthe first pitch estimation module 304, expressed mathematically withreference to equations (2) and (3), below. $\begin{matrix}{{\varphi_{i,k} = \frac{\sum\limits_{j = m}^{m + n - 1}{s_{j}s_{j + k}}}{\sqrt{e_{m}e_{m + k}}}},{k = 0},1,L,{{K - 1};{i = 0}},1,L,{M - 1}} & (2)\end{matrix}$

[0046] where: $\begin{matrix}{e_{m} = {\sum\limits_{l = m}^{m + n - 1}S_{l}^{2}}} & (3)\end{matrix}$

[0047] Because the value of the NCC pitch estimation function isindependent of the amplitude of adjacent audio frames, the second pitchestimation module 308 overcomes the accuracy shortcomings of other pitchestimators, but at a cost of computational complexity. Accordingly, asimplemented herein, the second pitch estimation module 308 receives asmaller sample size to act upon than does the first pitch estimationmodule 304, i.e., N>>M. The result of which is a computationallyefficient, while accurate pitch tracking module 212.

[0048] Again, the result of the second pitch estimation module 308, there-scored candidates are passed through dynamic programming andsmoothing module 316 which selects the best primary pitch and voicingstate candidates at each frame based, at least in part, on a combinationof local and transition costs. As used herein, the “local cost” is thepitch candidate ranking score generated through the dual pass pitchestimation modules 304, 308. The “transition costs” include one or moreratios of energy, zero crossing rate, Itakura distances and thedifference of fundamental frequency between the current and adjacentaudio frames 318 computed in module 310. Exemplary formulations of“transition costs” are provided below in equations (4), (5), (6), and(7).

[0049] Firstly, we assume the length of each speech waveform frame is T.For k th frame, we define the following variables:${{rms}(k)} = {\sum\limits_{i = {{k*T} + 1}}^{{({k + 1})}*T}x_{k}^{2}}$

rr(k)=rms(k)/rms(k−1) P  o  w(k) = α_(k)^(T)R_(k)α_(k)

S(k)Pow(k)Pow(k−1)

zcross(k)=The Number of Zero Cross In This Frame

cc(k)=zcross(k)|zcross(k−1)

SNR(k)=rms(k)rms

[0050] Where, x(t) is the amplitude if speech waveform on time t, andrr(k )>1 if the k th frame of signal is on the location of the beginningof a voiced segment, otherwise, rr(k )<1. α _(k) is the linearprediction coefficients, and R_(k) is the autocorrelation matrix, k thframe is like to (k−1) th one if S (k) is close to 1. cc(k) iszero-cross rate, and it will be larger then 1 when from voiced orsilence segment to unvoiced segment. rms is the average energy ofbackground, SNR(k) is signal noise ratio of this frame.

[0051] In the dynamic programming procedure, four kinds of transitioncost should be considered:

[0052] 1. cost A: from voiced segment to voiced one.

[0053] 2. cost B: from unvoiced segment to voiced one.

[0054] 3. cost C: from voiced segment to unvoiced one.

[0055] 4. cost D: from unvoiced segment to unvoiced one.

[0056] In fact, we assume each frame of signal can be either voiced orunvoiced, and calculate the cost in every possible case. At last, wewill determine the pitch value with the optimal cost (in this case,optimal cost is the maximum cost consisting of transition cost or valueand NCC value).

[0057] The formula of each kind of transition cost is listed asfollowing:

Trans_(A) =W _(a1) *abs(Candidate(k)−Candidate(k−1))  (4)

Trans_(B) =W _(b1) *abs(rr(k)*S(k))+W _(b2) *cc(k)+W _(b3) /SNR(k)  (5)

Trans_(C) =W _(c1) *abs(rr(k)*S(k))+W _(c2)* (rr(k)−1)+W _(c3)*cc(k)  (6)

Trans_(D) =W _(d1) +W _(d2) Log(S(k))  (7)

[0058] In above formula, all items name as W* are constants that may bedetermined by experiments.

[0059] Example Waveform and Pitch Tracking Result

[0060]FIGS. 4 and 5 are presented to illustrate the functional operationof dual-pass pitch tracking module 212. With initial reference to FIG.4, an illustration of an example audio waveform 400 is presented. Forease of illustration, three (3) periods of the waveform are illustrated,i.e., P₀, P₁ and P₂. The period of an audio signal is not to be confusedwith frame size selection, i.e., one period of a signal does notnecessarily equate to a parsed frame. Signals such as the one depictedin FIG. 4 are applied to dual-pass pitch tracking module 212, whichextracts pitch value information, and tracks such information acrossframes.

[0061] The pitch selection and tracking features of pitch detectionmodule 212 is graphically illustrated with reference to FIG. 5. Withbrief reference to FIG. 5, a spectral diagram of the identified pitchvalues within each of a number of frames are depicted wherein the solidline between pitch value candidates denote those candidates that wereselected as the most likely candidate based, at least in part, on thelocal and transition costs.

[0062] Example Operation and Implementation

[0063] Having introduced the functional and architectural elements ofthe dual-pass pitch tracking module 212, an example operation andimplementation is developed with reference to FIG. 6. For ease ofillustration, and not limitation, the teachings of the present inventionwill be illustrated with continued reference to the elements of FIGS.1-5.

[0064]FIG. 6 is a flow chart of an example method for detecting pitchvalues in received audio content, according to one implementation of thepresent invention. As shown, the method of FIG. 6 begins with block 602,wherein audio analyzer 129 receives an indication to analyze audiocontent. As introduced above, the indication may well be generated by aseparate application, e.g., a user interface application executing on ahost computing system (100), or may well come from an interfaceexecuting on audio analyzer 129 itself.

[0065] In response to receiving such an indication, audio controller 202of audio analyzer 129 opens one or more network communicationinterface(s) 208 to receive the audio content. As disclosed above,according to one implementation, the audio content may well be receivedin memory 204 of audio analyzer 129, and is selectively fed to dual-passpitch tracking module 212 for analysis by controller 202.

[0066] As audio analyzer 129 begins to receive audio content, controller202 selectively invokes an instance of dual-pass pitch tracking module212 with which to analyze the audio content and extract pitch valueinformation. As disclosed above, according to one implementation,dual-pass pitch tracking module 212 invokes an instance ofpre-processing module 302 to parse the received content into frames,eliminate any DC bias from the audio signal, and remove undesirablenoise artifacts from the received signal, block 604.

[0067] In block 606, the filtered audio signal frames are provided to afirst pitch estimation module 304, which identifies a first set of pitchvalue candidates. According to one implementation, the first pitchestimation module 304 employs an average magnitude difference function(AMDF) pitch extractor to identify N pitch value candidates. Asdisclosed above, the number of candidates generated (N) is based, atleast in part, on the sample rate of the audio content. Once the initialN candidates are identified, the candidates are filtered, and the mostprobable M candidates 306 are selected for re-scoring by the secondpitch estimation module 308, block 608.

[0068] Accordingly, in block 610 a second pitch estimation module 308 isinvoked to re-score the M pitch value candidates. As introduced above,the second pitch value estimation module 308 employs a more robust pitchvalue estimation algorithm than the first pitch estimation module. Anexample of just such robust pitch estimation algorithm suitable for usein the second pitch estimation module 308 is the normalizedcross-correlation (NCC) pitch extractor introduced above.

[0069] As described above, passing each frame of audio content througheach of the first 304 and second 308 pitch estimation modules generatesa local score for each of the top pitch value candidates within eachframe. In addition to the local score, dual-pass pitch tracking module212 selectively calculates 310 a transition score 318 for each of thecandidates as well. As introduced above module 310 generates atransition score 318 based on a ratio of any of a number of signalparameters between frames of the received audio signal. The generatedlocal and transition scores are provided to dynamic programming andsmoothing module 316, which selects the best pitch value candidate basedon these scores, block 612.

[0070] It is to be appreciated that the dual-pass pitch tracking systemintroduced above provides an effective solution to the problem ofgenerating accurate pitch value candidates in substantially real-time.By leveraging the speed of the first pitch estimation function and theacoustic accuracy of the second pitch estimation module, acomputationally efficient and accurate pitch detection system iscreated.

[0071] Alternate Implementations—Computer Readable Media

[0072] Turning to FIG. 7, an implementation of one or more elements ofthe architecture and related methods for streaming content acrossheterogeneous network elements may be stored on, or transmitted across,some form of computer readable media in the form of computer executableinstructions. According to one implementation, for example, instructions702 which when executed implement at least the dual-pass pitch trackingmodule may well be embodied in computer-executable instructions. As usedherein, computer readable media can be any available media that can beaccessed by a computer. By way of example, and not limitation, computerreadable media may comprise “computer storage media” and “communicationsmedia.”

[0073] As used herein, “computer storage media” include volatile andnon-volatile, removable and non-removable media implemented in anymethod or technology for storage of information such as computerreadable instructions, data structures, program modules, or other data.Computer storage media includes, but is not limited to, RAM, ROM,EEPROM, flash memory or other memory technology, CD-ROM, digitalversatile disks (DVD) or other optical storage, magnetic cassettes,magnetic tape, magnetic disk storage or other magnetic storage devices,or any other medium which can be used to store the desired informationand which can be accessed by a computer.

[0074] “Communication media” typically embodies computer readableinstructions, data structures, program modules, or other data in amodulated data signal, such as carrier wave or other transportmechanism. Communication media also includes any information deliverymedia.

[0075] The term “modulated data signal” means a signal that has one ormore of its characteristics set or changed in such a manner as to encodeinformation in the signal. By way of example, and not limitation,communication media includes wired media such as a wired network ordirect-wired connection, and wireless media such as acoustic, RF,infrared, and other wireless media. Combinations of any of the above arealso included within the scope of computer readable media.

[0076]FIG. 7 is a block diagram of a storage medium 700 having storedthereon a plurality of instructions including instructions 702 which,when executed, implement a dual-pass pitch tracking module 206 accordingto yet another implementation of the present invention. As used herein,storage medium 700 is intended to represent any of a number of storagedevices and/or storage media known to those skilled in the art such as,for example, volatile memory devices, non-volatile memory devices,magnetic storage media, optical storage media, and the like. Similarly,the executable instructions are intended to reflect any of a number ofsoftware languages known in the art such as, for example, C++, VisualBasic, Hypertext Markup Language (HTML), Java, eXtensible MarkupLanguage (XML), and the like. Accordingly, the software implementationof FIG. 7 is to be regarded as illustrative, as alternate storage mediaand software implementations are anticipated within the spirit and scopeof the present invention.

[0077] Although the invention has been described in language specific tostructural features and/or methodological steps, it is to be understoodthat the invention defined in the appended claims is not necessarilylimited to the specific features or steps described. It will beappreciated, given the foregoing, that the teachings of the presentinvention extend beyond the illustrative exemplary implementationspresented above.

1. A method comprising: identifying an initial set of pitch valuecandidates within each frame of a plurality of frames of received audiocontent utilizing a first pitch estimation algorithm; and reducing theinitial set of pitch value candidates to a select set of pitch valuecandidates based, at least in part, on pitch value re-scoring utilizinga second pitch estimation algorithm, wherein the select set of pitchvalues are selected in substantially real-time.
 2. The method accordingto claim 1, further comprising: calculating a transition probabilitybetween at least one of the select pitch value candidates of adjacentframes.
 3. The method according to claim 2, further comprising:selecting a pitch value within each frame with the highest transitionprobability between adjacent frames as the pitch value for the frame. 4.The method according to claim 2, wherein the transition probability isbased, at least in part, on dynamic programming configured to determinea significantly best path between different pitch candidates of adjacentframes.
 5. The method according to claim 2, further comprising:smoothing a curve representing the select pitch values over a pluralityof frames based, at least in part, on other information.
 6. The methodaccording to claim 5, wherein other information includes one or more ofan energy value for each frame, a zero crossing rate of the audiocontent, and/or a vocal tract spectrum of the audio content.
 7. Themethod according to claim 1, wherein identifying the initial set ofpitch value candidates within each frame comprises: passing each frameof audio content through an average magnitude difference function(AMDF); and selecting N near-zero minima pitch values in the audiocontent as the initial set of pitch value candidates.
 8. The methodaccording to claim 7, wherein N is set to 288 pitch value candidates,selected as the initial set of pitch value candidates based, at least inpart, on the AMDF.
 9. The method according to claim 1, whereinidentifying a select set of pitch values comprises: generating a localscore for each of the initial set of pitch value candidates utilizing anormalized cross-correlation function (NCCF); and selecting M pitchvalue candidates with the highest local score.
 10. The computer readablemedia having computer instructions for performing acts comprising:identifying an initial set of pitch value candidates within each frameof a plurality of frames of received audio content utilizing a firstpitch estimation algorithm; and reducing the initial set of pitch valuecandidates to a select set of pitch value candidates based, at least inpart, on pitch value re-scoring utilizing a second pitch estimationalgorithm, wherein the select set of pitch values are selected insubstantially real-time.
 11. The computer readable media according toclaim 10, having further computer instructions for performing actscomprising: calculating a transition probability between at least one ofthe select pitch value candidates of adjacent frames.
 12. The computerreadable media according to claim 11, having further computerinstructions for performing acts comprising: selecting a pitch valuewithin each frame with the highest transition probability betweenadjacent frames as the pitch value for the frame.
 13. The computerreadable media according to claim 11, wherein the transition probabilityis based, at least in part, on dynamic programming configured todetermine a significantly best path between different pitch candidatesof adjacent frames.
 14. The computer readable media according to claim11, having further computer instructions for performing acts comprising:smoothing a curve representing the select pitch values over a pluralityof frames based, at least in part, on other information.
 15. Thecomputer readable media according to claim 14, wherein other informationincludes one or more of an energy value for each frame, a zero crossingrate of the audio content, and/or a vocal tract spectrum of the audiocontent.
 16. The computer readable media according to claim 10, whereinidentifying the initial set of pitch value candidates within each framecomprises: passing each frame of audio content through an averagemagnitude difference function (AMDF); and selecting N near-zero minimapitch values in the audio content as the initial set of pitch valuecandidates.
 17. The computer readable media according to claim 16,wherein N is set to 288 pitch value candidates, selected as the initialset of pitch value candidates based, at least in part, on the AMDF. 18.The computer readable media according to claim 10, wherein identifying aselect set of pitch values comprises: generating a local score for eachof the initial set of pitch value candidates utilizing a normalizedcross-correlation function (NCCF); and selecting M pitch valuecandidates with the highest local score.
 19. An apparatus comprisinglogic configured to receive audio content, identify an initial set ofpitch value candidates within each frame of a plurality of frames of thereceived audio content utilizing a first pitch estimation algorithm, andreduce the initial set of pitch value candidates to a select set ofpitch value candidates based, at least in part, on pitch valuere-scoring utilizing a second pitch estimation algorithm, wherein theselect set of pitch values are selected in substantially real-time. 20.The apparatus according to claim 19, wherein the logic is furtherconfigured to calculate a transition probability between at least one ofthe select pitch value candidates of adjacent frames.
 21. The apparatusaccording to claim 20, wherein the logic is further configured to selecta pitch value within each frame with the highest transition probabilitybetween adjacent frames as the pitch value for the frame.
 22. Theapparatus according to claim 20, wherein the transition probability isbased, at least in part, on dynamic programming configured to determinea significantly best path between different pitch candidates of adjacentframes.
 23. The apparatus according to claim 20, wherein the logic isfurther configured to smoothing a curve representing the select pitchvalues over a plurality of frames based, at least in part, on otherinformation.
 24. The apparatus according to claim 23, wherein the otherinformation includes one or more of an energy value for each frame, azero crossing rate of the audio content, and/or a vocal tract spectrumof the audio content.
 25. The apparatus according to claim 19, wherein,when the logic identifies the initial set of pitch value candidateswithin each frame, the logic is further configured to pass each frame ofaudio content through an average magnitude difference function (AMDF),and select N near-zero minima pitch values in the audio content as theinitial set of pitch value candidates.
 26. The apparatus according toclaim 25, wherein N is set to 288 pitch value candidates, selected asthe initial set of pitch value candidates based, at least in part, onthe AMDF.
 27. The apparatus according to claim 19, wherein, when thelogic identifies the select set of pitch values, the logic is furtherconfigured to generate a local score for each of the initial set ofpitch value candidates utilizing a normalized cross-correlation function(NCCF), and select M pitch value candidates with the highest localscore.